20 research outputs found
The Design of Terra: Harnessing the Best Features of High-Level and Low-Level Languages
Applications are often written using a combination of high-level and low-level languages since it allows performance critical parts to be carefully optimized, while other parts can be written more productively. This approach is used in web development, game programming, and in build systems for applications themselves. However, most languages were not designed with interoperability in mind, resulting in glue code and duplicated features that add complexity. We propose a two-language system where both languages were designed to interoperate. Lua is used for our high-level language since it was originally designed with interoperability in mind. We create a new low-level language, Terra, that we designed to interoperate with Lua. It is embedded in Lua, and meta-programmed from it, but has a low level of abstraction suited for writing high-performance code. We discuss important design decisions - compartmentalized runtimes, glue-free interoperation, and meta-programming features - that enable Lua and Terra to be more powerful than the sum of their parts
The unexplained nature of reading.
The effects of properties of words on their reading aloud response times (RTs) are 1 major source of evidence about the reading process. The precision with which such RTs could potentially be predicted by word properties is critical to evaluate our understanding of reading but is often underestimated due to contamination from individual differences. We estimated this precision without such contamination individually for 4 people who each read 2,820 words 50 times each. These estimates were compared to the precision achieved by a 31-variable regression model that outperforms current cognitive models on variance-explained criteria. Most (around 2/3) of the meaningful (non-first-phoneme, non-noise) word-level variance remained unexplained by this model. Considerable empirical and theoretical-computational effort has been expended on this area of psychology, but the high level of systematic variance remaining unexplained suggests doubts regarding contemporary accounts of the details of the mechanisms of reading at the level of the word. Future assessment of models can take advantage of the availability of our precise participant-level database
Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging
Many graphics and vision problems can be expressed as non-linear least
squares optimizations of objective functions over visual data, such as images
and meshes. The mathematical descriptions of these functions are extremely
concise, but their implementation in real code is tedious, especially when
optimized for real-time performance on modern GPUs in interactive applications.
In this work, we propose a new language, Opt (available under
http://optlang.org), for writing these objective functions over image- or
graph-structured unknowns concisely and at a high level. Our compiler
automatically transforms these specifications into state-of-the-art GPU solvers
based on Gauss-Newton or Levenberg-Marquardt methods. Opt can generate
different variations of the solver, so users can easily explore tradeoffs in
numerical precision, matrix-free methods, and solver approaches. In our
results, we implement a variety of real-world graphics and vision applications.
Their energy functions are expressible in tens of lines of code, and produce
highly-optimized GPU solver implementations. These solver have performance
competitive with the best published hand-tuned, application-specific GPU
solvers, and orders of magnitude beyond a general-purpose auto-generated
solver
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
Training and deploying large machine learning (ML) models is time-consuming
and requires significant distributed computing infrastructures. Based on
real-world large model training on datacenter-scale infrastructures, we show
14~32% of all GPU hours are spent on communication with no overlapping
computation. To minimize the outstanding communication latency, in this work,
we develop an agile performance modeling framework to guide parallelization and
hardware-software co-design strategies. Using the suite of real-world large ML
models on state-of-the-art GPU training hardware, we demonstrate 2.24x and
5.27x throughput improvement potential for pre-training and inference
scenarios, respectively
Just-in-time Length Specialization of Dynamic Vector Code
Abstract Dynamically typed vector languages are popular in data analytics and statistical computing. In these languages, vectors have both dynamic type and dynamic length, making static generation of efficient machine code difficult. In this paper, we describe a tracebased just-in-time compilation strategy that performs partial length specialization of dynamically typed vector code. This selective specialization is designed to avoid excessive compilation overhead while still enabling the generation of efficient machine code through length-based optimizations such as vector fusion, vector copy elimination, and the use of hardware SIMD units. We have implemented our approach in a virtual machine for a subset of R, a vector-based statistical computing language. In a variety of workloads, containing both scalar and vector code, we show near autovectorized C performance over a large range of vector sizes